Tuesday, February 06, 2018

Introduction

Introduction

  • This presentation intents to introduce you the basic information, concepts and tools of data analysis and linear regression in R such as
    • importing data from an excel file
    • subsetting and manipulating the data
    • descriptive statistics
    • exploratory data analysis
    • linear regression
  • At the end of this presentation, you will
    • know how to import/export data from any excel file
    • have a basic knowledge of data analysis
    • perform statistical analysis
    • conduct basic exploratory data and linear regression analyses
  • In the development of this presentation and examples of it, I am not assuming any background on computer programming. Just a little knowledge of statistics, matrix algebra, and data structure is required.

Necessary Packages

  • First of all, let's install and load some of the necessary R packages we will use in this presentation.
  • The following example presents installing and loading any R package.
install.packages("rmarkdown")  ## Intalling packages.
library("rmarkdown")  ## Loading packages.
  • In the development of this document the following packages are used: devtools, rmarkdown, knitr, checkpoint, pastecs, psych, magrittr, ggplot2, plotly, gapminder, stargazer, leaflet, DT, gvlma.
    • These packages are pre-installed and loaded for you as long as R interactive session is started from the R mini BootCamp.Rproj file, the main file for this workshop.
  • Also, the development version of the ggplot2 package is installed from GitHub.
devtools::install_github('hadley/ggplot2')
library("ggplot2")

Data

Importing Data

Data and Metadata

  • In this presentation, we will use a cross-sectional data on wages.
  • The data and the corresponding metadata are contained in wage1.xls and wage1_metadata.txt files respectively.
    • Before importing data, it is very important to open and perform a visual check on both the raw data and metadata files.
    • Now, go to the R/Repo/Data folder and open both files.

Importing Data

Necessary Package

  • Since the raw data file, wage1.xls, is an excel file, we need to use a R package which is capable of importing excel files.
  • In R, there are several packages for importing data from an excel file such as XLConnect, xlsx, and readxl.
  • In this presentation, we will use only the XLConnect package.
  • The first step in importing an excel file is installing and loading the XLConnect package.
install.packages("XLConnect") ## Installing.
library("XLConnect") ## Loading.

Importing Data

Methods

  • The second step is importing the data from an excel file.
  • You can import the data in two different ways.

Method 1

data <- readWorksheetFromFile(file = "file.path", ...)

Method 2

  • First, use the loadWorkbook function to load an excel file as workbook.
  • Second, use the readWorksheet function to read data from the workbook.
workbook <- loadWorkbook(filename = "file.path")
data <- readWorksheet(object = workbook, sheet = "Sheet Name or Index", ...)

Importing Data

Methods

  • Let's import the wage1.xls excel file with both methods.
  • Note the sheet and header arguments in both methods.
    • sheet argument specifies the sheet name or index you want to load in the excel file.
    • header argument specifies whether the first row should be used as column names.
  • You can also use other arguments to import a certain region of a sheet in the excel file.
    • Typing ?readWorksheet in console shows the other arguments you can use.

Method 1

data <- readWorksheetFromFile(file = paste0(data.path, "wage1.xls"), sheet = "WAGE1", header = FALSE)

Method 2

workbook <- loadWorkbook(paste0(data.path, "wage1.xls"))
data <- readWorksheet(workbook, sheet = "WAGE1", header = FALSE)

Printing Data

  • Let's print the imported data.
  • As it can be seen from the below dynamic paged data table,
    • data has 526 rows (observations) and 24 columns (variables)
    • all variables are in numeric class which is indicated with <dbl> sign in the table
    • column names are generic and are needed to be renamed for further use.

Subsetting Data

  • In this presentation we will use only a part of the imported data.
  • Thus, before manipulating the data, we will subset the raw data with specific variables only.
  • We will be particularly interested in the following variables
    • wage: average hourly earnings (Col1)
    • educ: years of education (Col2)
    • exper: years potential experience (Col3)
    • tenure: years with current employer (Col4)
    • female: =1 if female (Col6)
data <- data[, c(1:4, 6)] ## Subsetting with column index numbers.
# data <- data[, c("Col1", "Col2", "Col3", "Col4", "Col6")] ## Subsetting with column names.

Manipulating Data

  • Since the column names are generic (i.e., Col1, Col2, …, Col24), we need to assign names to each variables.
colnames(data) <- c("Wage", "Education", "Experience", "Tenure", "Gender") ## New names.
  • For further use, let's perform the following data manipulations.
    • convert Gender variable to factor class where Male is the base level.
    • create Ln.Wage variable, natural logarithm of Wage.
    • create Experience.Sq variable, square of Experience.
data$Gender <- factor(data$Gender, labels = c("Male", "Female")) ## "0" is Male, "1" is Female.
data$Ln.Wage <- log(data$Wage) ## Column binds as the last column.
data$Experience.Sq <- data$Experience^2 ## Column binds as the last column.
  • Finally, let's re-arrange the column orders.
col.order <- c("Wage", "Ln.Wage", "Education", "Experience", "Experience.Sq", "Tenure", "Gender")
data <- data[, col.order]

Manipulating Data

  • Now, let's print the subsetted and manipulated data, i.e., modified data.

Missing Values

  • Let's check the missing values in each column before the descriptive statistics of the modified data.
    • In R, missing values are represented with NA.
    • You should use is.na function to check whether a value is missing.
sapply(data, function(missing) sum(is.na(missing)))  ## Checks number of missing values for all columns.
#>          Wage       Ln.Wage     Education    Experience Experience.Sq 
#>             0             0             0             0             0 
#>        Tenure        Gender 
#>             0             0
sum(sapply(data, function(missing) sum(is.na(missing))))  ## Checks number of missing values in all columns.
#> [1] 0

Structure of Data

  • Let's check the structure of the modified data.
str(data)
#> 'data.frame':    526 obs. of  7 variables:
#>  $ Wage         : num  3.1 3.24 3 6 5.3 ...
#>  $ Ln.Wage      : num  1.13 1.18 1.1 1.79 1.67 ...
#>  $ Education    : num  11 12 11 8 12 16 18 12 12 17 ...
#>  $ Experience   : num  2 22 2 44 7 9 15 5 26 22 ...
#>  $ Experience.Sq: num  4 484 4 1936 49 ...
#>  $ Tenure       : num  0 2 0 28 2 8 7 3 4 21 ...
#>  $ Gender       : Factor w/ 2 levels "Male","Female": 2 2 1 1 1 1 1 2 2 1 ...
  • You can also use dim, nrow, ncol, rownames, colnames, head, and tail functions to explore the structure of the data.
# R code chunk is not evaluated.

dim(data); nrow(data); ncol(data)
rownames(data); colnames(data)
head(data); tail(data)

Descriptive Statistics

Descriptive Statistics

  • Using some built-in R functions, the descriptive statistics of the manipulated data can be acquired.
  • Let's get the descriptive statistics for only Wage variable.
# R code chunk is not evaluated.

min(data$Wage); max(data$Wage) 
mean(data$Wage); median(data$Wage)
var(data$Wage); sd(data$Wage)
range(data$Wage)
skewness(data$Wage); kurtosis(data$Wage) ## moments and e1071 packages.
  • Loop functions such as sapply can be used to get a descriptive statistics of all variables.
sapply(data, function(col) mean(col))
#>          Wage       Ln.Wage     Education    Experience Experience.Sq 
#>      5.896103      1.623268     12.562738     17.017110    473.435361 
#>        Tenure        Gender 
#>      5.104563            NA

Descriptive Statistics

base Package

summary(data)
#>       Wage           Ln.Wage          Education       Experience   
#>  Min.   : 0.530   Min.   :-0.6349   Min.   : 0.00   Min.   : 1.00  
#>  1st Qu.: 3.330   1st Qu.: 1.2030   1st Qu.:12.00   1st Qu.: 5.00  
#>  Median : 4.650   Median : 1.5369   Median :12.00   Median :13.50  
#>  Mean   : 5.896   Mean   : 1.6233   Mean   :12.56   Mean   :17.02  
#>  3rd Qu.: 6.880   3rd Qu.: 1.9286   3rd Qu.:14.00   3rd Qu.:26.00  
#>  Max.   :24.980   Max.   : 3.2181   Max.   :18.00   Max.   :51.00  
#>  Experience.Sq        Tenure          Gender   
#>  Min.   :   1.0   Min.   : 0.000   Male  :274  
#>  1st Qu.:  25.0   1st Qu.: 0.000   Female:252  
#>  Median : 182.5   Median : 2.000               
#>  Mean   : 473.4   Mean   : 5.105               
#>  3rd Qu.: 676.0   3rd Qu.: 7.000               
#>  Max.   :2601.0   Max.   :44.000

Descriptive Statistics

pastecs Package

knitr::kable(stat.desc(data)[-c(1:3, 7, 10:11, 14), ], align = "c", digits = 3)
Wage Ln.Wage Education Experience Experience.Sq Tenure Gender
min 0.530 -0.635 0.000 1.000 1.000 0.000 NA
max 24.980 3.218 18.000 51.000 2601.000 44.000 NA
range 24.450 3.853 18.000 50.000 2600.000 44.000 NA
median 4.650 1.537 12.000 13.500 182.500 2.000 NA
mean 5.896 1.623 12.563 17.017 473.435 5.105 NA
var 13.639 0.283 7.667 184.204 379511.161 52.193 NA
std.dev 3.693 0.532 2.769 13.572 616.045 7.224 NA

Descriptive Statistics

psych Package

knitr::kable(as.data.frame(describe(data))[, -1], align = "c", digits = 3)
n mean sd median trimmed mad min max range skew kurtosis se
Wage 526 5.896 3.693 4.650 5.244 2.372 0.530 24.980 24.450 2.002 4.940 0.161
Ln.Wage 526 1.623 0.532 1.537 1.589 0.531 -0.635 3.218 3.853 0.390 0.374 0.023
Education 526 12.563 2.769 12.000 12.692 1.483 0.000 18.000 18.000 -0.618 1.866 0.121
Experience 526 17.017 13.572 13.500 15.640 14.085 1.000 51.000 50.000 0.705 -0.652 0.592
Experience.Sq 526 473.435 616.045 182.500 352.673 264.644 1.000 2601.000 2600.000 1.488 1.280 26.861
Tenure 526 5.105 7.224 2.000 3.493 2.965 0.000 44.000 44.000 2.104 4.629 0.315
Gender* 526 1.479 0.500 1.000 1.474 0.000 1.000 2.000 1.000 0.083 -1.997 0.022

Descriptive Statistics

stargazer Package

stargazer(data, type = "text", flip = TRUE, mean.sd = TRUE, min.max = TRUE, 
    iqr = TRUE, median = TRUE)  ## Use 'type = 'html'' for HTML output.
#> 
#> ==================================================================
#> Statistic  Wage  Ln.Wage Education Experience Experience.Sq Tenure
#> ------------------------------------------------------------------
#> N          526     526      526       526          526       526  
#> Mean      5.896   1.623   12.563     17.017      473.435    5.105 
#> St. Dev.  3.693   0.532    2.769     13.572      616.045    7.224 
#> Min       0.530  -0.635      0         1            1         0   
#> Pctl(25)  3.330   1.203     12         5           25         0   
#> Median    4.650   1.537     12        13.5        182.5       2   
#> Pctl(75)  6.880   1.929     14         26          676        7   
#> Max       24.980  3.218     18         51         2,601       44  
#> ------------------------------------------------------------------

Exploratory Data Analysis

Histogram

base Package

hist(data$Wage)

Histogram

base Package

hist(data$Wage, col = "lightblue", breaks = 50, xlim = c(0, max(data$Wage)), 
    ylim = c(0, 80), xlab = "Wage", main = "Histogram of Wage")
rug(data$Wage)  ## Plots all of the points under the histogram.
abline(v = mean(data$Wage), col = "red", lwd = 1, lty = 2)  ## Vertical line.

Density Plot

base Package

hist(data$Wage, col = "red", breaks = 100, xlim = c(0, max(data$Wage)), ylim = c(0, 
    0.7), freq = FALSE, density = 10, xlab = "Wage", main = "Density of Wage")

Bar Plot

base Package

barplot(table(data$Education), col = "wheat", xlab = "Education", ylab = "Number of Employees", 
    main = "Number of Employees by Education")

Bar Plot

base Package

barplot(table(data$Gender, data$Education), col = c("blue", "red"), xlab = "Education", 
    ylab = "Number of Employees", main = "Number of Employees by Gender and Education")
legend("topright", pch = 15, col = c("blue", "red"), legend = c("Male", "Female"))

Box Plot

base Package

boxplot(data$Wage ~ data$Gender, col = c("blue", "red"), names = c("Male", "Female"), 
    ylab = "Wage", main = "Box Plots of Wage for Male and Female")

Scatter Plot

base Package

plot(x = data$Education, y = data$Wage, axes = FALSE, col = "red", bg = "green", 
    cex = 0.75, pch = 21, xlab = "Education", ylab = "Wage", main = "Relationship Between Wage and Education")
axis(side = 1, at = c(0:max(data$Education)))
axis(side = 2)
box()

Scatter Plot Matrix

base Package

plot(data[, c("Wage", "Education", "Experience")], cex = 0.75)

Line Plot

base Package

  • For the line plot, we need a time-series data frame. We will use a built-in R data frame called AirPassengers, which shows the monthly airline passenger numbers 1949-1960.
plot(AirPassengers, type = "l", ylab = "# of Passangers")

Interactive Plot

ggplot2 and plotly Packages

#> We recommend that you use the dev version of ggplot2 with `ggplotly()`
#> Install it with: `devtools::install_github('hadley/ggplot2')`

Interactive Thematic Map

plotly Package

Interactive Map

leaflet Package

Animation

ggplot2, plotly and gapminder Packages

Linear Regression

Simple Linear Regression

Model Estimation

  • In R, you need to use lm function for linear regressions.
model <- lm(formula = data$Wage ~ data$Education, singular.ok = FALSE) ## Sometimes useful.
model <- lm(data = data, formula = Wage ~ Education, singular.ok = FALSE) ## Better for printing.
  • For quick model estimation results, use the attributes of model object.
# R code chunk is not evaluated.

attributes(model) ## Gives the model attributes for quick model estimation results.
model$coefficients; coef(model) ## Coefficients.
confint(model) ## Confidence interval of coefficients.
model$residuals; residuals(model) ## Residuals.
model$df.residual ## Degrees of freedom for residuals.
model$fitted.values; fitted(model) ## Fitted values.

Simple Linear Regression

Model Summary

summary(model) ## Prints detailed model estimation results.
#> 
#> Call:
#> lm(formula = Wage ~ Education, data = data, singular.ok = FALSE)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.3396 -2.1501 -0.9674  1.1921 16.6085 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -0.90485    0.68497  -1.321    0.187    
#> Education    0.54136    0.05325  10.167   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.378 on 524 degrees of freedom
#> Multiple R-squared:  0.1648, Adjusted R-squared:  0.1632 
#> F-statistic: 103.4 on 1 and 524 DF,  p-value: < 2.2e-16

Simple Linear Regression

Model Summary

  • Use the attributes of model summary object, to get the quick model estimation results.
# R code chunk is not evaluated.

model.sum <- summary(model)
str(model.sum) ## Structure of model summary object.
model.sum$coefficients ## Coefficients.
model.sum$r.squared ## R squared.
model.sum$adj.r.squared ## Adjusted R squared.
model.sum$sigma ## Sigma (residual standard error).
model.sum$cov.unscaled ## Variance-covariance matrix.
model.sum$df ## Degrees of freedoms
model.sum$fstatistic ## F-statistics of the model.

Simple Linear Regression

Model Summary

stargazer(model, type = "text", style = "qje")  ## For a nice tabulated printing, use stargazer package.
#> 
#> ==========================================================
#>                                      Wage                 
#> ----------------------------------------------------------
#> Education                          0.541***               
#>                                    (0.053)                
#>                                                           
#> Constant                            -0.905                
#>                                    (0.685)                
#>                                                           
#> N                                    526                  
#> R2                                  0.165                 
#> Adjusted R2                         0.163                 
#> Residual Std. Error            3.378 (df = 524)           
#> F Statistic                103.363*** (df = 1; 524)       
#> ==========================================================
#> Notes:              ***Significant at the 1 percent level.
#>                      **Significant at the 5 percent level.
#>                      *Significant at the 10 percent level.

Simple Linear Regression

Estimated Regression Line

with(data, plot(Education, Wage, main = "Estimated Regression Line"))
abline(model, col = "red")

Simple Linear Regression

ANOVA

  • For the analysis of variance (ANOVA), you should use the anova function.
print.data.frame(anova(model))
#>            Df   Sum Sq    Mean Sq  F value       Pr(>F)
#> Education   1 1179.732 1179.73205 103.3627 2.782597e-22
#> Residuals 524 5980.682   11.41352       NA           NA

Simple Linear Regression

Model Diagnosis

par(mfrow = c(2,2)) # Change the panel layout to 2 x 2.
plot(model)

par(mfrow = c(1,1)) # Change back to 1 x 1.

Simple Linear Regression

Model Diagnosis

  • Some important R functions for model diagnosis are listed below.
# R code chunk is not evaluated.

plot(gvlma(model))  ## Model diagnosis plots (from gvlma package).
summary(gvlma(model))  ## Checks all the linear regression assumptions (from gvlma package).
gvlma(model)  ## Similar to above but without model estimation results.

vif(model)  ## For Multicollinearity
outlierTest(model)  ## For Outliers.

outlierTest(model)  ## Bonferonni p-value for most extreme observations.
qqPlot(model, main = "QQ Plot")  # Normality of residuals (qq plot for studentized residuals).
shapiro.test(residuals(model))  ## Normality check with Shapiro-Wilk test.
ad.test(residuals(model))  ## Normality check with Anderson-Darling test.

ncvTest(model)  ## Non-constant error variance test (homoskedasticity).

Multiple Linear Regression

Model Estimation

  • You can run a multiple linear regression by adding other variable with + sign to the formula argument in the lm function.
model <- lm(data = data, formula = Wage ~ Education + Experience + Tenure, singular.ok = FALSE)
stargazer(model, type = "html", style = "qje")  ## For a nice tabulated printing, use stargazer package.
Wage
Education 0.599***
(0.051)
Experience 0.022*
(0.012)
Tenure 0.169***
(0.022)
Constant -2.873***
(0.729)
N 526
R2 0.306
Adjusted R2 0.302
Residual Std. Error 3.084 (df = 522)
F Statistic 76.873*** (df = 3; 522)
Notes: ***Significant at the 1 percent level.
**Significant at the 5 percent level.
*Significant at the 10 percent level.

Multiple Linear Regression

Dummy Variables

  • You can add a dummy variable to a linear regression like other explanatory variables.
  • The only important point is that you need to know the base level of the dummy variable.
    • In our modified data frame, Male is the base level of the dummy variable Gender.
    • You can change the base level of a dummy variable by re-defining the base level with ref argument in the relevel function, e.g., relevel(data$Gender, ref = "Female").
model.1 <- lm(data = data, formula = Wage ~ Education + Experience + Gender, 
    singular.ok = FALSE)  ## Base level: Male.

data$Gender <- relevel(data$Gender, ref = "Female")

model.2 <- lm(data = data, formula = Wage ~ Education + Experience + Gender, 
    singular.ok = FALSE)  ## Base level: Female.

data$Gender <- relevel(data$Gender, ref = "Male")  ## Revert the base level back to Male.

Multiple Linear Regression

Dummy Variables

stargazer(model.1, model.2, type = "html", style = "all2")
Dependent variable:
Wage
(1) (2)
Education 0.603*** 0.603***
(0.051) (0.051)
Experience 0.064*** 0.064***
(0.010) (0.010)
GenderFemale -2.156***
(0.270)
GenderMale 2.156***
(0.270)
Constant -1.734** -3.890***
(0.754) (0.727)
Observations 526 526
R2 0.309 0.309
Adjusted R2 0.305 0.305
Residual Std. Error (df = 522) 3.078 3.078
F Statistic (df = 3; 522) 77.920*** (p = 0.000) 77.920*** (p = 0.000)
Note: p<0.1; p<0.05; p<0.01

Multiple Linear Regression

Squared Variables

  • While using squared variables, you need to be very careful.
model.1 <- lm(data = data, formula = Wage ~ Experience + Experience^2) ## Not correct.
model.2 <- lm(data = data, formula = Wage ~ Experience + I(data$Experience^2)) ## Correct.
model.3 <- lm(data = data, formula = Wage ~ Experience + Experience.Sq) ## Correct.
Dependent variable:
Wage
(1) (2) (3)
Experience 0.031*** 0.298*** 0.298***
(0.012) (0.041) (0.041)
Experience2) -0.006***
(0.001)
Experience.Sq -0.006***
(0.001)
Constant 5.373*** 3.725*** 3.725***
(0.257) (0.346) (0.346)
Observations 526 526 526
R2 0.013 0.093 0.093
Adjusted R2 0.011 0.089 0.089
Residual Std. Error 3.673 (df = 524) 3.524 (df = 523) 3.524 (df = 523)
F Statistic 6.766*** (df = 1; 524) (p = 0.010) 26.740*** (df = 2; 523) (p = 0.000) 26.740*** (df = 2; 523) (p = 0.000)
Note: p<0.1; p<0.05; p<0.01

Multiple Linear Regression

Model Comparison

  • Finally, let's run several multiple linear regressions with different specifications and compare them.
model.1 <- lm(data = data, formula = Ln.Wage ~ log(Experience), singular.ok = FALSE)

model.2 <- lm(data = data, formula = Wage ~ Education + Experience + Tenure, singular.ok = FALSE)

model.3 <- lm(data = data, formula = Ln.Wage ~ Education + Experience + Tenure, singular.ok = FALSE)

model.4 <- lm(data = data, formula = Wage ~ Education + Experience + Tenure + Gender, singular.ok = FALSE)

Multiple Linear Regression

Model Comparison

Dependent variable:
Ln.Wage Wage Ln.Wage Wage
(1) (2) (3) (4)
log(Experience) 0.117***
(0.021)
Education 0.599*** 0.092*** 0.572***
(0.051) (0.007) (0.049)
Experience 0.022* 0.004** 0.025**
(0.012) (0.002) (0.012)
Tenure 0.169*** 0.022*** 0.141***
(0.022) (0.003) (0.021)
GenderFemale -1.811***
(0.265)
Constant 1.343*** -2.873*** 0.284*** -1.568**
(0.056) (0.729) (0.104) (0.725)
Observations 526 526 526 526
R2 0.055 0.306 0.316 0.364
Adjusted R2 0.053 0.302 0.312 0.359
Residual Std. Error 0.517 (df = 524) 3.084 (df = 522) 0.441 (df = 522) 2.958 (df = 521)
F Statistic 30.421*** (df = 1; 524) (p = 0.00000) 76.873*** (df = 3; 522) (p = 0.000) 80.391*** (df = 3; 522) (p = 0.000) 74.398*** (df = 4; 521) (p = 0.000)
Note: p<0.1; p<0.05; p<0.01

Thank You